Universal Sentence Encoder

2018-09-25

NLP

来自Google Research的一篇文章，在这篇文章中作者们提出了一种通用句子编码器，相比于传统的word embedding，该编码器在多个不同的NLP任务上都取得了更好的准确率，可以用来做迁移学习。
paper link
code link

Introduction

In this paper, we present two models for producing sentence embeddings that demonstrate good transfer to a number of other of other NLP tasks.

embed = hub.Module("https://tfhub.dev/google/universal-sentence-encoder/2")
embeddings = embed([
"The quick brown fox jumps over the lazy dog.",
"I am a sentence for which I would like to get its embedding"])

print session.run(embeddings)

# The following are example embedding output of 512 dimensions per sentence
# Embedding for: The quick brown fox jumps over the lazy dog.
# [-0.016987282782793045, -0.008949815295636654, -0.0070627182722091675, ...]
# Embedding for: I am a sentence for which I would like to get its embedding.
# [0.03531332314014435, -0.025384284555912018, -0.007880025543272495, ...]

_This module is about 1GB. Depending on your network speed, it might take a while to load the first time you instantiate it. After that, loading the model should be faster as modules are cached by default (learn more about caching). Further, once a module is loaded to memory, inference time should be relatively fast._

文章共提出两种基于不同网络架构的Universal Sentence Encoder：

Our two encoders have different design goals. One based on the transformer architecture targets high accuracy at
the cost of greater model complexity and resource consumption. The other targets efficient inference with slightly reduced accuracy.

Encoders

Transformer

具体原理参见:https://helicqin.github.io/2018/03/30/Attention%20is%20all%20you%20need/

该结构可实现最好的迁移学习准确率，但当句子长度增加时，计算时间和内存消耗会急剧增加。

Deep Averaging Network (DAN)

The second encoding model makes use of a deep averaging network (DAN) (Iyyer et al.,2015) whereby input embeddings for words and bi-grams are first averaged together and then passed through a feedforward deep neural network (DNN) to produce sentence embeddings.

该结构最大的优势在于计算时间与句子长度呈线性关系。

Transfer Learning Models

对于文本分类任务，将两种结构的sentence encoder的输出作为分类模型的输入；
对于语义相似度任务，直接通过sentence encoder的输出向量计算相似度：

As shown Eq. 1, we first compute the cosine similarity of the two sentence embeddings and then use arccos to convert the cosine similarity into an angular distance.We find that using a similarity based on angular distance
performs better on average than raw cosine similarity.
$$sim(u, v) = (1 - arccos(\frac{u \cdot v}{\left | u \right | \left | v \right |})/\pi ) \: \: \: \:\: \: \: \:\: \: \: \: (1)$$

Baselines

本文共构建两种baselines：

使用word2vec的baseline
未使用任何预训练模型

Combined Transfer Models

本文还尝试将sentence level 和 word level两种模型融合，实验结果如下。

Experiments

MR : Movie review snippet sentiment on a five star scale (Pang and Lee, 2005).
CR : Sentiment of sentences mined from customer reviews (Hu and Liu, 2004).
SUBJ : Subjectivity of sentences from movie reviews and plot summaries (Pang and Lee, 2004).
MPQA : Phrase level opinion polarity from news data (Wiebe et al., 2005).
TREC : Fine grained question classification sourced from TREC (Li and Roth, 2002).
SST : Binary phrase level sentiment classification (Socher et al., 2013).
STS Benchmark : Semantic textual similarity (STS) between sentence pairs scored by Pearson correlation with human judgments (Cer et al.,2017).

实验结论：

基于Transform的USE往往优于DAN
USE优于仅仅使用word level encoder
最优结果往往是sentence level和word level结合

Table 3 illustrates transfer task performance for varying amounts of training data. We observe that, for smaller quantities of data, sentence level transfer learning can achieve surprisingly good task performance. As the training set size increases, models that do not make use of transfer learning approach the performance of the other models.

Conclusion

基于sentence level的USE模型在大部分迁移学习任务上优于word level，尤其是在小规模数据集上，sentence level与word level结合则能实现最佳的准确率。